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DETAILED ACTION 



1. This communication is in response to Remarks, filed 01/12/07. 

2. Claims 1-4, 6-18, 20, 22 and 35 are pending. Claims 1,11 and 18 are 
independent. 

Continued Examination Under 37 CFR 1.114 

3. A request for continued examination under 37 CFR 1.114, including the fee set 
forth in 37 CFR 1 .17(e), was filed in this application after final rejection. Since this 
application is eligible for continued examination under 37 CFR 1.1 14, and the fee set 
forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action 
has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 
01/12/07 has been entered. 

Claim Rejections - 35 USC § 103 

4. The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 102 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 
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5. Claims 1-4, 6, 9, 11-13, 15-18, 20, 22, 24, 27-31 and 33-35 are rejected under 35 
U.S.C. 103(a) as being unpatentable over Anderson (6,499,016) in view of Bernardi et 
al. (6,101,338). 

As to claim 1 , Anderson teaches: 

a media capture mechanism, (digital camera, col. 2, lines 55-57); 

an audio input receptive of user speech relating to a media capture activity in 
close temporal relation to the media capture activity, (recording category-specific voice 
annotations on the camera at the time of capture, col. 3, lines 10-14); 

a media tagger adapted to tag captured media with text generated by said 
speech recognizer based on close temporal relation between receipt of recognized user 
speech and capture of the captured media, (each of the image files the user identified 
for voice recognition are processed by translating each of the voice annotations in the 
image file into a text annotation, (col. 5, lines 31-34). It would be necessary that since 
each of the images are tagged with voice annotations within the camera, and each of 
the voice tags are then translated into a text annotation, each of the images would be 
tagged with a text annotation. Furthermore, it would be necessary for a system that 
creates the text annotations once a connection to the internet is established, where the 
system contains wireless capabilities (col. 5, lines 10-12), the text annotations could be 
applied in close temporal relation); 

a media annotator adapted to annotate the captured media with a sample of the 
user speech that is suitable for input to a speech recognizer based on close temporal 
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relation between receipt of the user speech and capture of the captured media, (a 
categorizing system that is able to apply categories along with voice annotations to the 
image as the image at the time of capture, col. 4, lines 16-24).\ 
Anderson does not teach: 

a plurality of focused speech recognition lexica respectively relating to media 
capture activities; 

a user interface having a menu structure of hierarchically organized media 
capture activities and adapted to permit a user to navigate between and select one of 
the lexica by media capture activity; nor . 

a speech recognizer adapted to recognize the user speech based on a selected 
one of the focused speech recognition lexica. 

However, Bernardi et al. teach plurality of available speech commands, a set of 
commands selectable through a user interface, the selected commands acting as the 
available speech recognition dictionary (lexica) and as different commands are selected 
on the display, a new set of commands become available to the user where the 
commands are recognized by a speech recognizer, the commands for image capture 
activities are organized hierarchically (col. 2, lines 45-60). 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time of the invention to combine the methods of Anderson with the selectable speech 
command dictionaries of Bernardi et al. to create a camera with speech recognition 
without having to have the user memorize a set of words or phrases available for 
recognition, as taught by Bernardi et al. (col. 1, lines 30-50). 
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As to claim 1 1 , Anderson teaches: 

a portable media capture device adapted to capture media, (digital camera, col. 

2, lines 55-57); 

to receive user speech in close temporal relation to a media capture activity 
(recording category-specific voice annotations on the camera at the time of capture, col. 

3, lines 10-14); 

adapted to annotate captured media with a sample of the user speech that is 
suitable for input to a speech recognizer based on close temporal relation between 
receipt of the user speech and capture of the captured media, (each of the image files 
the user identified for voice recognition are processed by translating each of the voice 
annotations in the image file into a text annotation, (col. 5, lines 31-34). It would be 
necessary that since each of the images are tagged with voice annotations within the 
camera, and each of the voice tags are then translated into a text annotation, each of 
the images would be tagged with a text annotation. Furthermore, it would be necessary 
for a system that creates the text annotations once a connection to the internet is 
established, where the system contains wireless capabilities (col. 5, lines 10-12), the 
text annotations could be applied in close temporal relation); 

a post processor adapted to receive annotations from the device, perform speech 
recognition on the annotations, and tag related capture media with text generated 
during speech recognition performed on the annotations, (receiving annotations and 
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tags (col. 4, lines 31-33), performing speech voice recognition within the camera, (col. 6, 
lines 48-50)). 

Anderson does not teach to permit a user to employ a user interface having a 
menu structure of hierarchically organized media capture activities to navigate between 
and select one of a plurality of focused speech recognition lexica by media capture 
activity, nor performing speech recognition based on a selected one of the focused 
speech recognition lexica that respectively relate to media capture activities. 

However, Bernardi et al. teach plurality of available speech commands, a set of 
commands selectable through a user interface, the selected commands acting as the 
available speech recognition dictionary (lexica) and as different commands are selected 

on the display, a new set of commands become available to the user where the 

< 

commands are recognized by a speech recognizer, the commands for image capture 
activities are organized hierarchically (col. 2, lines 45-60). 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time of the invention to combine the methods of Anderson with the selectable speech 
command dictionaries of Bernardi et al. to create a camera with speech recognition 
without having to have the user memorize a set of words or phrases available for 
recognition, as taught by Bernardi et al. (col. 1, lines 30-50). 



As to claim 18, Anderson teaches: 
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capturing media with the media capture device during a media capture activity 
conducted by a user of the device, (digital camera for capturing images, col. 2, lines 55- 
57); 

receiving user speech via an audio input of the device in close temporal relation 
to the media capture activity, (recording category-specific voice annotations on the 
camera at the time of capture, (col. 3, lines 10-14) each of the image files the user 
identified for voice recognition are processed by translating each of the voice 
annotations in the image file into a text annotation, (col. 5, lines 31-34). It would be 
necessary that since each of the images are tagged with voice annotations within the 
camera, and each of the voice tags are then translated into a text annotation, each of 
the images would be tagged with a text annotation. Furthermore, it would be necessary 
for a system that creates the text annotations once a connection to the internet is 
established, where the system contains wireless capabilities (col. 5, lines 10-12), the 
text annotations could be applied in close temporal relation); 

annotating captured media by storing the captured media in memory of the 
device in association with a sample of the user speech that is suitable for input to a 
speech recognizer, (a categorizing system that is able to apply categories along with 
voice annotations to the image as the image at the time of capture, col. 4, lines 16-24); 

recognizing the user speech with a speech recognizer of the device (voice 
annotations are entered for the image captures, where the voice annotations relate to 
categories such as location, history, caption, and occasion, with each category related 
to the process of organizing images, col. 4, lines 15-24); 
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tagging captured media with recognition text generated during recognition of the 
user speech by storing the captured media in memory of the device in association with 
the recognition text, (the voice annotations are translated into text annotations, and 
stored along with the captured image, col. 5, lines 22-27). 

Anderson does not teach permitting a user to navigate a menu structure of 
hierarchically organized media capture activities and thereby select media capture 
activity a focused speech recognition lexicon relating to the media capture activity from 
a plurality of focused lexica relating to media capture activities that are stored in 
memory of the device. 

However, Bernard! et al. teach plurality of available speech commands, a set of 
commands selectable through a user interface, the selected commands acting as the 
available speech recognition dictionary (lexica) and as different commands are selected 
on the display, a new set of commands become available to the user where the 
commands are recognized by a speech recognizer, the commands for image capture 
activities are organized hierarchically (col. 2, lines 45-60). 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time of the invention to combine the methods of Anderson with the selectable speech 
command dictionaries of Bernardi et al. to create a camera with speech recognition 
without having to have the user memorize a set of words or phrases available for 
recognition, as taught by Bernardi et al. (col. 1, lines 30-50). 
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As to claim 2, Anderson teaches an input receptive of a user identity, wherein 
said speech recognizer is adapted to recognize user speech based on the user identity, 
(recognizing the voice annotations based on the user logged into the system, col. 5, 
lines 20-30). 

As to claim 3, Anderson suggests the speech recognizer is adapted to employ 
focused lexica based on the user identity (recognizing the voice annotations based on 
the user logged into the system, (col. 5, lines 20-30). It would be obvious to one of 
ordinary skill in the art to one of ordinary skill in the art at the time of the invention that 
the voice processing service identifies the user before voice annotations are translated, 
and lexica based on that user would be loaded so the voice processing service would 
be familiar with the vocabulary of the user, increasing the ability of the voice processing 
service to create correct translations of the users speech). 

As to claims 4 and 20, Anderson does not teach wherein said speech recognizer 
is adapted to select a lexicon based on the user speech and a predefined heuristic 
relating to voice tags associated with the lexica. 

However, Bernardi et al. teach plurality of available speech commands, a set of 
commands selectable through a user interface, the selected commands acting as the 
available speech recognition dictionary (lexica) and as different commands are selected 
on the display, a new set of commands become available to the user where the 
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commands are recognized by a speech recognizer, the commands for image capture 
activities are organized hierarchically (col. 2, lines 45-60). 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time of the invention to combine the methods of Anderson with the selectable speech 
command dictionaries of Bernardi et al. to create a camera with speech recognition 
without having to have the user memorize a set of words or phrases available for 
recognition, as taught by Bernardi et al. (col. 1 , lines 30-50). 

As to claim 6, Anderson teaches a media retrieval mechanism adapted to 
retrieve captured media from memory of the device by matching a tag of the captured 
media to recognition text generated from user speech received and recognized during a 
retrieval mode of the device, (the user is able to select all the images corresponding to 
certain keywords such as images comprising tags, such as a search for pictures taken 
"Beach" on "Vacation", (col. 6, lines 39-42). 

As to claim 9, Anderson teaches an external data interface adapted to transmit 
annotations to a post processor having greater speech recognition capabilities than said 
device (uploading images and voice annotations that were not translated on the device, 
to be translated on a server, col. 5, lines 20-35). 
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As to claim 12, Anderson does not teach a source of predefined, focused lexica 
relating to media capture activities and adapted to communicate focused lexica to said 
media capture device according to device type over a communication network. 

However, Bernardi et al. teach plurality of available speech commands, a set of 
commands selectable through a user interface, the selected commands acting as the 
available speech recognition dictionary (lexica) and as different commands are selected 
on the display, a new set of commands become available to the user where the 
commands are recognized by a speech recognizer, the commands for image capture 
activities are organized hierarchically (col. 2, lines 45-60). 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time of the invention to combine the methods of Anderson with the selectable speech 
command dictionaries of Bernardi et al. to create a camera with speech recognition 
without having to have the user memorize a set of words or phrases available for 
recognition, as taught by Bernardi et al. (col. 1, lines 30-50). 

As to claims 13, 30 and 31, Anderson teaches communicating with a post- 
processor over a communication network, and transferring lexica to and from the device 
(connecting the media device to server with a better ability to translate the voice 
annotations, col. 5, lines 20-35). 

Anderson does not teach a source of predefined; focused lexica relating to media 
capture activities. 
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However, Bernardi et al. teach plurality of available speech commands, a set of 
commands selectable through a user interface, the selected commands acting as the 
available speech recognition dictionary (lexica) and as different commands are selected 
on the display, a new set of commands become available to the user where the 
commands are recognized by a speech recognizer, the commands for image capture 
activities are organized hierarchically (col. 2, lines 45-60). 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time of the invention to combine the methods of Anderson with the selectable speech 
command dictionaries of Bernardi et al. to create a camera with speech recognition 
without having to have the user memorize a set of words or phrases available for 
recognition, as taught by Bernardi et al. (col. 1, lines 30-50). 

As to claims 15 and 33, Anderson teaches a mapping module adapted to convert 
textual tags associated with captured media to alternative textual tags based on 
predetermined criteria relating to a media capture activity (converting tags associated 
with the media to alternative tags when the images are uploaded to the server, col. 6, 
lines 35-45). 

As to claim 16, Anderson teaches said device is adapted to perform a relatively 
limited amount of speech recognition of the annotation compared to an amount of 
speech recognition performed by said post-processor (the media device uploads 
annotations to a server to due further post-processing of the annotations, col. 6, lines 
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35-45), the relatively limited amount being limited in at least one of time and search 
space due to at least one of lower processing power and relatively lower memory 
capacity of said device (if memory allocations permit, limited speech recognition can be 
performed on the camera, where it would be obvious to one of ordinary skill in the art at 
the time of the invention that less memory and processing and power would be held on 
the media capture device as compared to the server for translating the annotations), 
and to tag related captured media with recognition text generated during the relatively 
limited amount of speech recognition (tagging text with annotation with the speech 
recognition available on the media capture device, col. 6, lines 47-53). 

As to claims 17, 34 and 35, Anderson teaches the post-processor is receptive of 
captured media from said device, and is adapted to organize the captured media 
according to at least one of annotations and textual tags associated with the captured 
media, including clustering at least one of annotations and textual tags based on at 
least one of acoustic similarity measures and semantic similarity measures, (photo 
albums are created by grouping together all the images with text annotations having 
matching keywords, col. 6, lines 39-43). 

As to claim 22, Anderson teaches receiving a user identity, wherein said step of 
recognizing the user speech is based on the user identity (recognizing the voice 
annotations based on the user logged into the system, col. 5, lines 20-30). 
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As to claim 24, Anderson teaches retrieving captured media from memory of the 
device by matching a tag of the captured media to recognition text generated from user 
speech received and recognized during a retrieval mode of the device (the user is able 
to select all the images corresponding to certain keywords such as images comprising 
tags, such as a search for pictures taken "Beach" on "Vacation", (col. 6, lines 39-42). 

As to claim 27, Anderson suggests lexicon contents and storing the lexicon 
contents in device memory, (a lexicon with commands/categories for categorizing the 
images, (col. 4, and lines 8-11). It would be necessary that the lexicon be loaded and 
stored in the memory of the device, for future use). 

As to claim 28, Anderson teaches transferring annotations from the device to a 
post processor having greater speech recognition capability than the device (the media 
device uploads annotations to a server to due further post-processing of the 
annotations, col. 6, lines 35-45). 

As to claim 29, Anderson teaches performing speech recognition on annotations 
received from the device; and tagging related captured media with text generated during 
speech recognition performed on the annotations (performing speech recognition on the 
annotations, and tagging text with annotation with the speech recognition available on 
the media capture device, col. 6, lines 47-53). 
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6. Claims 7, 8, 10, 14, 23, 25, 26 and 32 are rejected under 35 U.S.C. 103(a) as 
being unpatentable over Anderson in view of Bernardi et al. (6,101,338) as applied to 
claims 1, 14, 18 and 22 above, and further in view of Li etal. (6,397,181). 

As to claims 7 and 25, Anderson and Bernardi et al. do not teach a media 
retrieval mechanism adapted to retrieve captured media from memory of the device by 
matching an annotation of the captured media to user speech received during a retrieval 
mode of the device using sound similarity metrics to align an annotation with a spoken 
query. 

However, Li et al. teach retrieving data based on an annotation of the captured 

» 

media, where the input voice is compared, and data is only returned when a level 
similarity is met, (col. 8, lines 15-33). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to combine the media capture methods of Anderson and Bernardi et al. 
with the language customization of Li et al. to create an improved system of indexing 
and retrieving media content, as taught by Li et al. (col. 1, lines 19-21). 

As to claims 8 and 26, Anderson and Bernardi et al. do not teach a lexicon editor 
adapted to supplement a lexicon based on an annotation, letter to sound rules, and user 
speech corresponding to spelled word input received and recognized during a lexicon 
edit mode of the device. 
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However, Li et al. teaches a sample language defined in Bachus-Naur Form 
(BNF), where the BNF can be customized to a specific topic or a specific 
speaker/narrator, (col. 4, lines 54-67). It would be obvious to one of ordinary skill in the 
art at the time of the invention that since the sample language is customizable, the 
language would be editable based on letter to sound rules, and user speech 
corresponding to a spelled word input (col. 5, lines 23-30, and 40-47), to decrease the 
amount of memory needed to stored the recognition language, and to increase the 
flexibility and improve the method of indexing the media, (col. 1, lines 20-21). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to combine the media capture methods of Anderson and Bernardi et al. 
with the language customization of Li et al. to create an improved system of indexing 
and retrieving media content, as taught by Li et al. (col. 1 , lines 19-21). 

As to claim 10, Anderson teaches: 

an external data interface receptive of lexicon contents (connecting the media 
device to server with a better ability to translate the voice annotations, col. 5, lines 20- 
35). 

Anderson and Bernardi et al. do not teach a lexicon editor adapted to store the 
lexicon contents in device memory. 

However, Li et al. teaches a sample language defined in Bachus-Naur Form 
(BNF), where the BNF can be customized to a specific topic or a specific 
speaker/narrator, (col. 4, lines 54-67). 
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Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to combine the media capture methods of Anderson and Bernardi et al. 
with the language customization of Li et al. to create an improved system of indexing 
and retrieving media content, as taught by Li et al. (col. 1 , lines 19-21). 

As to claim 14, Anderson and Bernardi et al. do not teach a lexicon editor 
provided to at least one of the device and the post processor and adapted to customize 
a focused lexicon for a user of the device. 

However, Li et al. teach the system is able to customize the sample language, by 
defining the language to a specific topic, or a specific speaker/narrator, (col. 4, lines 54- 
67). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to combine the media capture methods of Anderson and Bernardi et al. 
with the language customization of Li et al. to create an improved system of indexing 
and retrieving media content, as taught by Li et al. (col. 1, lines 19-21). 

As to claim 23, Anderson and Bernardi et al. do not teach selecting, based on the 
user identity, a focused speech recognition lexicon relating to the media capture activity 
from a plurality of focused lexica relating media capture activities that are stored in 
memory of the device. 

However, Li et al. teach a plurality of focused speech recognition lexica 
respectively relating to media capture activities (a sample language defined in Bachus- 
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Naur Form (BNF), where the BNF can be customized to a specific topic or a specific 
speaker/narrator, (col. 4, lines 54-67). It would be obvious to one of ordinary skill in the 
art at the time of the invention that since the sample language is customizable, there 
would be the ability to have a plurality of focused speech recognition language 
available. Furthermore, it would be obvious to one of ordinary skill in the art at the time 
of the invention that since the user is able to customize the sample language, the user 
would customize the sample language depending on the necessary elements of media 
device, to increase the flexibility and improve the method of indexing the media, (col. 1 , 
lines 20-21). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to combine the media capture methods of Anderson and Bernardi et al. 
with the language customization of Li et al. to create an improved system of indexing 
and retrieving media content, as taught by Li et al. (col. 1, lines 19-21). 

As to claim 32, Anderson and Bernardi et al. do not teach customizing a focused 
lexicon for a user of the device. 

However, Li et al. teach customizing the language by customizing the topic or a 
specific speaker/narrator, (col. 4, lines 63-67). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to combine the media capture methods of Anderson and Bernardi et al. 
with the language customization of Li et al. to create an improved system of indexing 
and retrieving media content, as taught by Li et al. (col. 1, lines 19-21). 



Application/Control Number: 10/677,174 
Art Unit: 2626 



Page 19 



Response to Arguments 

7. Applicant's arguments with respect to claims 1,11 and 1 8 have been considered 
but are moot in view of the new ground(s) of rejection. 

Conclusion 

8. The prior art made of record and not relied upon is considered pertinent to 
applicant's disclosure. See PTO-892. 



9. Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Thomas E. Shortledge whose telephone number is 
(571)272-7612. The examiner can normally be reached on M-F 8:00 - 4:30. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Richemond Dorvil can be reached on (571)272-7602. The fax phone 
number for the organization where this application or proceeding is assigned is 571- 
273-8300. 
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Information regarding the status of an application may be obtained from the 
Patent Application Information Retrieval (PAIR) system. Status information for 
published applications may be obtained from either Private PAIR or Public PAIR. 
Status information for unpublished applications is available through Private PAIR only, 
For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 
you have questions on access to the Private PAIR system, contact the Electronic 
Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a 
USPTO Customer Service Representative or access to the automated information 
system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 



TS 

03/23/07 




